KDE and Hotspot Workflow
replica_hotspot.RmdIntro
This vignette shows how to run a Kernel Density Estimate (KDE) and
Hotspot analysis using the data retrieved using
replicaToolkitR package. The goal of this document is not
to present a complete analysis but to showcase how easily it is to
implement such analyses using the data products created by
replicaToolkitR from Replica.
NOTE: This vignette uses data that comes from
replicaToolkitR, the user should have a precursory understanding of how to acquire this data themselves usingreplicaToolkitR::query_network_trip_using_bbox().
Data Overview
The data used in this analysis was acquired using the study area bounding box seen below.
Note: Any trip travelling though this zone is queried via
replicaToolkitR. This is done to account for all vehicles traveling on study area network links. The full list of trips queried all listed below:
internal-to-internal: study area to study area tripsinternal-to-external: study area to external tripsexternal-to-internal: external to study area tripsexternal-to-external: external to external trips (through trips)
replicaToolkitR, however; filters for trips that only originate within the study area polygon when displaying or aggregating anything regarding trip origins.
The below map details trip origins as counts aggregated by each trips
first network link. Links in this case are displayed as points
determined by starting coordinates. Normally, this data would be loaded
from the replica_trip_origin_links.gpkg file the user
creates when querying Google but in this instance it is loaded from the
package.
data("replica_trip_origin_links_gpkg")
replica_trip_origin_links_fltrd = replica_trip_origin_links_gpkg %>%
st_filter(study_area_network) %>%
st_transform(crs = 32610)KDE and Hotspot
Data Prep
The SpatialKDE and sfhotspot packages are
used to perform the KDE analysis. The latter is built upon the former
and provides some convenience wrappers if you don’t want to get into get
into the nitty-gritty of some of the function options. Both are used in
this vignette.
First, create a grid that covers the spread of my data. In this case, a cell size of 200 meters was chosen as it is roughly the size of a city blocks in Seattle and a good unit for this analysis.
grid_sm = create_grid_hexagonal(
replica_trip_origin_links_fltrd
,cell_size = 200)The water around the city is then removed using the
tigris package.
This is an important step, as it make the distinction between a location that does not generate any trips and one incapable of generating trips - i.e. freight trips cannot start in the middle of Lake Washington.
grid_sm_water_removed = tigris::erase_water(grid_sm) %>%
st_cast("POLYGON")Both grids are displayed in the below map. The grid with removed water features is seen in the right panel:
As you can see it is not perfect; Green Lake, Union Bay and other water features have not been removed. In addition, this idea can be extended to other locations incapable of generating trips such as parks, universities, airport tarmac, etc.
The R package mapedit provides functions that allow the
user to easily draw and create their own polygons can be used to remove
non-trip generating similar to the water removal above. This vignette
does not cover this.
Analysis Execution
The below code runs the **hotspot_kde()** function for
different vehicle types in the data and using the different grids
previously created. The **purrr::pmap()** function allows
us to iterate over multiple arguments simultaneously resulting in a list
object (and saves us a bunch of key strokes!).
hotspot_obeject = list(
list("MEDIUM_COMMERCIAL", "MEDIUM_COMMERCIAL", "HEAVY_COMMERCIAL", "HEAVY_COMMERCIAL")
,list(grid_sm_water_removed, grid_sm, grid_sm_water_removed, grid_sm)
) %>%
pmap(~{
temp_hotspot = replica_trip_origin_links_fltrd %>%
filter(vehicle_type == .x) %>%
hotspot_kde(
data = .
,grid = .y
,weights = count
,bandwidth = 800) %>%
mutate(kde_norm = gauntlet::normalize_min_max(kde)) %>%
mutate(label = str_glue("KDE (Min-Max Norm): {dgt2(kde_norm)}<br>KDE (raw): {dgt2(kde)}<br>TTL Trips: {sum}")) %>%
st_transform(crs = 4326)
})The resulting 2D kernel density estimates for trip origins of heavy (left panel) and medium (right panel) duty vehicles are displayed below:
The results are pretty interesting - their is a rather dramatic difference between where heavy and medium duty trucks go in Seattle. Heavy duty trucks stick to the ports, where medium duty tricks are more distributed through out Seattle (they’re smaller and can get around easier, makes sense!).
This analysis can be extended by using hotspot_gistar() function, which allows us to easily calculate the gi-star Z-score statistic used in identifying clusters of point locations. The theory behind this statistic will not be covered here but should be reviewed before implementing it.
temp_hotspot = replica_trip_origin_links_fltrd %>%
filter(vehicle_type == "MEDIUM_COMMERCIAL") %>%
hotspot_gistar(
data = .
,grid = grid_sm_water_removed
,weights = count
,bandwidth = 800)The map below depicts the result of calculating the gi-star Z-score statistic. Statistically significant hot and cold locations are displayed:
temp_hotspot %>%
filter(pvalue <= .01) %>%
mutate(flag_gistar = case_when(gistar > 0~"Hot", T~"Cold")) %>%
mapview(layer.name = "Hotspots", zcol = "flag_gistar")By just using the default input options, we get some pretty interesting findings. We see rather large cold spots in Green Lake, Union Bay, Discovery Park (west of Magnolia) - this makes sense as the first two locations are bodies of water and the third is a large wooded park where freight trucks don’t go. In comparison to their neighbors, these locations are especially cold regions for trucks and this identified. One could take these findings and remove these areas in subsequent analysis iterations.
Considerations
This was a very simple vignette intended to highlight a specific
workflow stemming from data acquired from the
replicaToolkitR package.
For a more detailed analysis, one should consider:
- full set of locations that should be excluded from the analysis
- this will help prevent identifying spurious hot/cold locations
- the proper unit of measure for your grid
- the bandwidth used in the KDE and gistar calculations
- the kernel type used in the KDE and gistar calculations
- the default quadratic kernel was used but their are others that can be specified
- the neighbor distance used in the gistar calculations
- this impacts whether a cell will statistically varies from its neighbors
- altering this will impact the size and shape of hot and cold spots ***